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Abstract 

We present a block-based, automatic partitioning and scheduling method- 
ology for sparse matrix factorization on distributed memory systems. Using 
experimental results, we analyze this technique for communication and load 
imbalance overhead. To study the performance effects, we compare these over- 
heads with those obtained from a straightforward “wrap-mapped” column as- 
signment scheme. All experimental results were obtained using test sparse 
matrices from the Harwell-Boeing data set. The results show that there is a 
communication and load balance trade-off. The block-based method results in 
lower communication cost whereas the wrap-mapped scheme gives better load 
balance. 
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1 Introduction 


Partitioning and scheduling the parallel execution of large scientific applications on 
distributed memory systems is a difficult and time consuming task. If the dependen- 
cies involved are unstructured, as in the case of sparse linear systems, then the task 
becomes even more complex. Use of naive techniques to extract parallelism often 
results in large communication overhead and/or in large load imbalance. To reduce 
communication overhead, locality of data must be exploited and to balance the load, 
the computations must be evenly distributed at all times. When the data depen- 
dencies are non-uniform and unstructured, achieving these two goals simultaneously 
is difficult. As a result, in such cases, the overall performance may turn out to be 
poor, even if an application has a high degree of extractable parallelism. One possible 
way to minimize the overhead is to make use of the structure of the sparse system 
which can usually be determined prior to performing the numerical computations. 
When direct methods are used to solve the sparse systems, this information in the 
form of the structure of the factored matrix is routinely used to reduce computation 
and/or storage costs. Recently, this information has also been applied in extracting 
parallelism while maintaining low communication and load imbalance costs [5], [6], 
[14]. However, in most cases, parallelism has been extracted manually, which tends 
to be extremely tedious, error prone, and inflexible. Thus, automation is the key to 
successful parallelization of such applications. To summarize, there are two important 
issues in the efficient parallelization of sparse matrix based computations: 

• Developing technology for the automatic parallelization of the computations. 

• Developing a methodology for the extraction of the available parallelism with 
minimum communication and load imbalance costs. 

To address these issues, we have developed an automatic, block-based scheme for par- 
titioning and scheduling the computations in factoring a sparse matrix. The scheme 
makes use of the structure of the factor and is targeted towards distributed memory 
systems. To reduce communication, it takes advantage of locality. However, to main- 
tain proper load balance and a high degree of parallelism, the scheme makes use of 
an adaptive technique in distributing the computational work. 

To demonstrate the usefulness of such a partitioning scheme and to bring out the 
performance limitations that are inherent in sparse matrix computations, we compare 
the communication overhead and the degree of load balance in the automated block- 
based approach with that obtained from a straightforward and widely used column- 
based approach. In the latter scheme, computations associated with an entire column 
or row are assigned to a processor and the assignment of these columns or rows is 
usually done in a “wrap-around” fashion. We refer to this scheme as the wrap-mapping 
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or wrap scheme. For comparing the performance on practical applications, we present 
results for some of the Harwell- Boeing test matrices. 

In the following discussion, it is assumed that the reader is familiar with the standard 
terminology used in the context of sparse matrix computations. For an explanation, 
see [7], [3]. 

The organization of the rest of the paper is as follows. In the next section, the 
Cholesky factorization is briefly described and some of the terminology used in the 
paper is introduced. The partitioning and scheduling strategies that are used for 
automation are presented in Section 3. Performance results are described in Section 4 
and Section 5 concludes the paper. 


2 Cholesky factorization 

The partitioning and scheduling methodology is described in this paper assuming 
Cholesky factorization as the model numerical method of computation. The Cholesky 
algorithm is simple, well understood, and is widely used. Note, however, that the 
techniques presented here are applicable to other factoring methods as well. In the 
following, we highlight only those aspects of this algorithm that are essential for 
describing the partitioner. For details on the Cholesky factorization scheme, see [9]. 

For the sake of completeness, we first briefly describe the four steps involved in the 
direct solution of ,4x = b. (For details see, for example, [8].) It is assumed that A 
is symmetric, positive definite and that Cholesky factorization is used in computing 
the factor L, where A = LL T . 

1. Ordering : Find a good ordering of the unknowns for elimination. The ordering 
is given by a permutation matrix P. Most often, a “good” ordering implies 
one which would lead to a sparse factor and fewer arithmetic operations in the 
numerical factorization step. 

2. Symbolic Factorization : Determine the sparsity structure of the factor L . 

3. Numerical Factorization : Compute L. 

4. Triangular Solutions : Using the computed L, solve the triangular systems Lu — 
P b, L t v = u and set x = P T v. 


The basic element-level data dependencies in the factorization process are shown in 
Figure 1. 
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Figure 1: Inter-element dependencies in Cholesky factorization 

In that figure, only the lower triangular part of the matrix to be factored is shown. 
Lij denotes the element in row i and column j . The direction of the arrows indicates 
the data flow. Thus, elements Lj,k and L^k from column k of the factor L are required 
in computing element Uj. L itj = L itj - L i<k * L jtk is the corresponding operation in 
the Cholesky factorization. (Initially L itj is set to A itj .) We refer to this operation as 
a single update operation. Note that in computing the final value of Lij, it must be 
updated by all pairs of non-zero elements Lj k and L i k , 1 < k < j . Finally, after all 
the updates are performed, the element is scaled by the square root of the diagonal 
element in that column. 


3 Partitioning and scheduling 


The partitioning scheme presented here is static in the sense that all the computations 
are partitioned before any of the computations are scheduled for execution. For this, 
the partitioner takes as an input the structure of the factor for the sparse matrix. 
However, the scheme is general and does not have knowledge of any matrix structure 
embedded in it. 

As stated in the introductory section, the aim of the partitioner and the scheduler 
is to reduce communication and at the same time maintain a balanced work load 
among processors at all times. To achieve this, wherever possible, data locality is 
exploited. This leads to some variation of block-based partitioning; such partitioning 
approaches have been proposed in several linear algebra related problems [2], [12]. 
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With blocking, it is possible to achieve a high ratio of computation to communication 
per block. In [11], it is shown that for an important class of problems, the block-based 
partitioning schemes result in an optimal utilization of the data accessed (and thus 
reduce data traffic). Blocking, however, could lead to load imbalance because the 
increase in the size of schedulable units results in a loss of flexibility in distributing 
work among processors. To avoid this, the partitioner described here partitions the 
factored matrix into blocks of varying sizes that can be assigned in an equitable 
manner to the processors. It makes use of a heuristic where the block sizes are 
subject to adaptive manipulation. In the following we describe the functioning of the 
partitioner in some detail. 

The partitioning starts with the zero-nonzero structure of the filled sparse matrix 
obtained after the symbolic factorization phase has been completed. Blocks of non- 
zero areas are identified in the filled matrix. We refer to these as dense blocks. On 
occasions, blocks are formed by including small regions that correspond to zeros in 
the factored matrix in order to obtain larger blocks. Inclusion of such areas with zero 
elements is kept to a minimum. The work in these dense blocks is partitioned into 
sub-blocks which are the basic schedulable units. These unit blocks have a regular 
shape - each unit block is either a column, a rectangle or a triangle. After all the unit 
blocks are identified, the dependencies between these blocks are determined. Finally 
the unit blocks are assigned and scheduled on processors. 

Thus, the steps involved in the automatic partitioning and scheduling are: 

• Identify dense blocks in the symbolic factor. 

• Partition each dense block into schedulable unit blocks. 

• Generate and store dependency information for the unit blocks. 

• Schedule these units on the processors of a message passing system. 

• Consolidate the non-local memory access information for each processor so as 
to minimize communication overhead. 

In the remainder of this section, we will describe the first four steps. 


3.1 Identification of dense blocks 

To identify the dense blocks, first clusters of columns are determined in the sparse 
triangular factor. A cluster is a either a column or a strip of consecutive columns. 
If it is a strip, it contains a dense triangular block at the top and (possibly) a set of 
off-diagonal dense rectangular blocks. This is illustrated using an example shown in 
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Figure 2. In that figure, non-zero elements in the filled 41 x 41 matrix are indicated 
by the dark areas. The matrix corresponds to a 5-point finite element 5x5 grid and is 
ordered using Liu’s multiple minimum degree algorithm [10]. It was generated using 
the Sparse Matrix Manipulation System developed at the University of Wisconsin [1]. 



Figure 2: A 41 x 41 filled matrix. 

In Figure 2, note the following in the lower triangular part. Cluster 1 spans columns 
1 and 2 and cluster 2 spans columns 3 and 4. Both clusters 1 and 2 have a three- 
element dense triangular block at the diagonal. Cluster 1 has three dense rectangles 
below the triangle, each of which is 1 x 2, while cluster 2 has two dense rectangles, the 
upper one being 1x2 and the lower one being 2x2. Clusters 3 through 12 are single 
columns starting with cluster 3 at column 5. The last cluster consists of columns 35 
through 41. This cluster has one dense triangle and no rectangles below it. Note that 
in this illustration we do not consider column 34 as part of the last cluster because 
of the zero in row 38 of this column. But this can be over-ridden by allowing some 
zeros to be a part of a triangle. 

Once the clusters and the triangular and rectangular blocks within each cluster are 
identified, the algorithm processes the clusters left to right in the matrix. When a 
cluster is processed, each block in the cluster is partitioned into sub-blocks which are 
schedulable units. Next, for each unit, the dependencies are determined and stored. 
These steps are explained below. 
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3.2 Partitioning of a block 


A cluster with a single column is considered to be a schedulable unit and is not 
subject to further partitioning. In a multi-column cluster, the triangular block is 
partitioned first. In general, the number of partitions of a triangle are determined by 
(a) the number of processors that are assigned to the blocks on which the triangle 
depends, (b) a certain minimum work requirement per unit sub-block. The first pa- 
rameter restricts communication to the group of processors that work on the triangle 
and its predecessors. The second parameter is used to ensure a satisfactory ratio of 
computation to communication for each unit block and is an architecture dependent 
parameter. This parameter may be used to vary block sizes from one cluster to the 
next. For the results presented here we use a fixed size - one for all the triangular 
block and another for the rectangular blocks. This is referred to as the grain size and 
is the minimum number of matrix elements required in each unit block. The grain 
size dictates a maximum number of partitions, say P^. A block is partitioned into at 
most Pd equal sized units; at most because it may not always be possible to break up 
a block into exactly Pd equal sized units. 



Figure 3: Partitions 

Figure 3 illustrates this partitioning. The triangle is partitioned into six parts. One 
of the rectangles is partitioned into four parts and the other is partitioned into three 
parts. 
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3.3 Identification of dependencies 


The dependencies in a single update operation at the element level of Cholesky fac- 
torization are shown in Figure 1. However, for allocation and scheduling of the units, 
it is necessary to identify the dependencies at the block level. In this step, for each 
unit block, the dependencies are determined and the information on the actual data 
needed in the update operation is stored. This step also identifies columns or block 
units that are independent, i.e., those that are not updated by any other units. To 
automate this process, it is necessary to classify the dependencies at the inter-block 
level. We have classified these dependencies into ten categories which are enumerated 
next. Using this classification and the interval tree structure, the partitioner computes 
the dependencies efficiently. The implementation details are given elsewhere. 

In the following discussion, a column is represented by its column number in the 
matrix, a rectangle is represented by its column extent (cj,c^), Ci < Cj and row extent 
i r pi r q)i r p < r 9 > an( l a tr i an gl e is represented by its row extent (or column extent, 
which is the same as the row extent) {r h rj), Ti < Tj. 

1. A column updates a column 

This forms the base case for the dependencies. A column k updates a column 
j if Lj t k is non-zero, (see Figure 1). 

2. A column updates a triangle 

Let triangle T’s row extent be (r x ,r 2 )- A column k, k < r x , updates the triangle 
if L i<k is non-zero, n < i < r 2 . In Figure 4(a), the non-zero elements of column 
k that are involved in the update are in rows i x , i 2 and i 3 . The points of 
intersection of the dotted lines with each other and of the dotted lines with the 
diagonal are the points of triangle T that are updated by column k. 

3. A column updates a rectangle 

Let rectangle R's column extent be (cj, c 2 ) and row extent be (ri, r 2 ). A column 
k updates this rectangle if it has pairs of non- zero elements Li t k and Lj,fc, where 
d <i < c 2 and r x < j < r 2 . In Figure 4(b), the non-zero elements in rows i x 
and iz combine with the non-zero elements in rows j x , j 2 and j 3 to update a 
portion of R. This updated portion is the set of points given by the intersection 
of the dotted lines in R' s interior. 

4. A triangle updates a rectangle 

Let the column extent of rectangle R be (ci, c 2 ) and the column extent of triangle 
T be (c 3 ,c 4 ). The triangle updates the rectangle if there is an intersection in 
their column extents. In Figure 4(c), the shaded portion of T updates the 
shaded portion of R. 
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5 . A triangle and a rectangle update a rectangle 

Let rectangle Ri have column extent (ci,c 2 ) and row extent (ri,r 2 ) and let 
rectangle R 2 have column extent (03,04) and row extent (r 3 ,r 4 ). Let c 2 < c 3 . 
Let the column extent of triangle T be (c 5 ,C6). T combines with R\ to update 
R 2 if (ci,c 2 ) intersects (c 5 ,c 6 ), (c 3 ,c 4 ) intersects (c 5 ,C6) and (r l5 r 2 ) intersects 
( r 3 ) ^4) • In Figure 4 (d), the shaded rectangular portion of T combines with the 
entire shaded rectangle to update the entire shaded rectangle R 2 . 

6. A rectangle updates a column 

Let the row extent of rectangle R be (ri,r 2 ). It updates a column k if r x < k < 
r 2 . In Figure 4 (e), the shaded portion of the rectangle between rows k and r 2 
update the column elements between rows k and r 2 . 

7 . Two rectangles update a column 

Let rectangle Ri have column extent (ci,c 2 ) and row extent (r 1 ,7’ 2 ) and let 
rectangle R 2 have column extent (c 3 ,c 4 ) and row extent (r 3 ,r 4 ). Let r 2 < r 3 . 
Then R\ combines with R 2 to update a column k if 7*1 < k < r 2 and (ci,c 2 ) 
intersects (c 3 ,c 4 ). In Figure 4 (f), the elements of R\ which are in the row k 
between the vertical dotted lines combine with the entire shaded rectangle R 2 
to update the elements between rows r 3 and r 4 in column k. 

8. A rectangle updates a triangle 

Let the row extent of rectangle R\ be (ri,r 2 ) and the row extent of triangle T 
be (r 3 ,r 4 ). The rectangle updates the triangle if (r 4 ,r 2 ) intersects (r 3 ,r 4 ). In 
Figure 4 (g), the shaded portion of R updates the shaded portion of T. 

9 . Two rectangles update a triangle 

Let rectangle R\ have column extent (ci,c 2 ) and row extent (T*i,r 2 ) and let 
rectangle R 2 have column extent (c 3 ,c 4 ) and row extent (r 3 , r 4 ). Let r 2 < r 3 . 
Let the row extent of triangle T be (r 5 ,r a ). Then R\ combines with R 2 to 
update T if (ci,c 2 ) intersects (c 3 ,c 4 ) and (r 4 ,r 2 ) intersects (r 6 ,r6) and (r 3 ,r 4 ) 
intersects (r 5 ,r 6 ). In Figure 4 (h), the shaded portion of Ri combines with the 
entire shaded rectangle R 2 to update the shaded rectangular portion of T. 

10 . Two rectangles update a rectangle 

Let rectangle Ri have column extent (ci, c 2 ) and row extent (r 2 , r 2 ), rectangle R 2 
have column extent (c 3 , c 4 ) and row extent (r 3 , r 4 ) and rectangle f? 3 have column 
extent (c 5 , c 6 ) and row extent (r 5 , r 6 ). Let r 2 < r 3 , r 2 < r 5 and c 4 < c 5 . Then R x 
combines with R 2 to update R 3 if (ci, c 2 ) intersects (c 3 , c 4 ) and (r 3 , r 4 ) intersects 
(r 5 ,r a ) and (ri,r 2 ) intersects (c 5 ,c 6 ). In Figure 4 (i), the shaded portion of Ri 
combines with the shaded portion of R 2 to update the shaded part of R 3 . 
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3.4 Scheduling 


The scheduling process is split up into two parts: allocating unit blocks to processors 
and ordering the computational work within each processor. In this paper, we are 
concerned with the first part only and the salient points therein are presented next. 

First the independent columns, as identified in the previous step, are allocated to 
processors in a wrap-around fashion. The remaining clusters are scanned again from 
left to right. If a cluster is a dependent column, the entire column is allocated to a 
processor, which is arbitrarily picked from the set of processors which worked on the 
column’s predecessors. If the cluster is not a column, the unit blocks in the triangular 
part are allocated to processors, followed by the unit blocks in each rectangular block, 
going top to bottom. For example, in the cluster shown in Figure 3 , the six sub-blocks 
of the triangle would be allocated first, followed by the four sub-blocks of the rectangle 
below it, finishing up with the three sub-blocks of the bottom-most rectangle. 

Allocation within a triangle proceeds by first allocating the triangular units from 
top to bottom, followed by the rectangular units, going top to bottom and left to 
right. In the Figure 3 for instance, the sub-blocks in the triangle would be allocated 
in the order < 3 , f 6 > i 2 , £4, ^5- A global set of all processors, P g) is maintained, 
with a marker pointing to the first “available” processor. This marker cycles through 
the global set in a round-robin fashion and is moved up every time a unit block is 
allocated to the currently available processor. Apart from this, a set of processors, P a , 
which have been already allocated to some sub-block in the triangle is maintained. 
Initially, P a is empty. The strategy for allocating a processor to a unit rectangle or 
unit triangle is the same. First, the predecessors of the unit block are scanned. For 
each predecessor, if the processor p which worked on it is not in P a , the unit block 
is allocated to p and p is added to P a . If all of the processors which worked on all 
the predecessors of the unit block are already in P a , the unit block is allocated to the 
currently available processor in P g and the marker is moved up to the next processor 
in P g . 

For allocating the units within a rectangle below the triangle, the choice of processors 
is restricted to P t , where P t is the set of processors to which the unit blocks in the 
triangle are allocated. Since there is a large amount of communication between a tri- 
angle and the rectangles below it, this strategy helps in reducing the communication. 
First, the processors in set P t are ordered according to increasing work. Going in 
round-robin fashion through P f , the processors are assigned to the unit blocks in the 
rectangle, going top to bottom and left to right within the rectangle. For example, 
let processors p 1? p 2 and p 3 be assigned to the unit blocks on the triangle in Figure 3 . 
Assume that the ordering according to work is such that p\ < p 2 < p 3 . Then, in 
the first rectangle below the triangle, r u is allocated to pi, r \ 2 is allocated to p 2 , ^13 
is allocated to p 3 , r i4 is allocated to pi. The set P t is sorted again and the above 
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strategy is used to allocate 7 * 21 , 7*22 and 7 * 23 . 


4 Performance 


In this section we present results on the performance of the above described partitioner 
and scheduler, in terms of the quality of partitioning and allocation that it produces. 
To quantify the results, we measure the communication overhead in terms of the total 
data traffic generated and the load balance in terms of a factor that measures the 
deviation from perfect load balance. We also compare the results with those using 
the straightforward column wrap assignment scheme. For this purpose, we have 
used some of the representative test matrices from the Harwell-Boeing package [4]. 
These test matrices were partitioned and the work units were scheduled as described 
in the previous section. Using this output, simulations were carried out to get the 
performance results presented here. 


Application 

No. of 
eqns. 

No. of 
non-zeros 

No. of 
non- zeros 
in factor 

Description 

BUS1138 

1138 

2596 

3304 

Symmetric structure of power 
system networks 

CANN1072 

1072 

6758 

20512 

Symmetric pattern from 
Cannes, Lucien Marro 

DWT512 

512 

2007 

3786 

Symmetric submarine frame 
from Naval Ship Research 
and Development Center 

LAP30 

900 

4322 

16697 

Symmetric matrix representing 
9-point discretization of the 
Laplacian on the unit square w/ 
Dirichlet boundary conditions 

LSHP1009 

1009 

3937 

18268 

Symmetric matrix from 
Alan George’s LSHAPE probs. 


Table 1: Selected Harwell-Boeing Test Matrices 

For all the results presented in this section, the test matrices were ordered using Liu’s 
modified multiple minimum degree ordering scheme [10]. We used some of the tools 
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from SPARSKIT [13] and the Wisconsin Sparse Matrix Manipulation System [1] for 
converting the test matrices into various formats, and for ordering and symbolically 
factoring the matrices. Table 1 describes the Harwell-Boeing test matrices which were 
used in our experiments. 

In the following, we first quantify the communication and work load distribution 
aspects of the partitioning schemes. Note that here we are concerned with the quality 
of the partitioner/scheduler in distributing the work among the processors and hence 
do not take into account data dependency delays. In practice, the total execution may 
be affected by the dependency delays as well. However, if the number of processors 
is relatively small compared to the number of schedulable units, then the allocation 
scheme described here provides enough parallelism to keep the idle time to a minimum. 

The communication cost is parameterized by the total data traffic generated in the 
system and the mean data traffic per processor. The data traffic is defined as a count 
of all the non-local data accesses. Accessing a single non-local element constitutes a 
unit data traffic irrespective of the location from where it is fetched. Once a data 
element is fetched, that element is stored locally and subsequent usage of that element 
in the local computations does not add to the data traffic. The total data traffic in 
the system is the sum of the data accesses by all the processors in the system. This 
figure represents the volume of the data that must be transmitted by the system 
during the entire factorization step. 

The work load distribution of a partitioning scheme is characterized as follows. The 
computation cost of updating an element of the matrix by a pair of off-diagonal 
elements is assumed to be two units; updating the element by the diagonal element is 
assumed to cost one unit. The computational work assigned to a processor is the sum 
of the computation costs of all the elements assigned to that processor. The quality 
of the work load distribution for a partitioning scheme is measured in terms of the 
load imbalance resulting from the assignment of the work to the processors. The load 
imbalance factor is defined as, 

. _ (W max - W ave ) * N 
W tot 

where W to t is the total work, N is the number of processors, W ave = W to t/N is the 
average work and W max is the maximum work assigned to any processor. Note that 
when the load is perfectly distributed, W max is W ave and A is zero. The load imbalance 
factor can be related to the efficiency e, which is the ratio of speedup to number of 
processors, where speedup is the ratio of sequential time to parallel time. In the case 
of zero idle times due to dependency delays, the parallel time is simply the amount 
of computational work in the processor with the maximum work. The efficiency can 
then be expressed as, 

W M _ W ave 
W * N W 

rr max ^ vv max 


ll 



which gives us 


(Wmax ~ Wave) * N = - Wq„e = 1 _ 

Wiot W ave e 


Table 2 gives the communication traffic in the block scheme for two cases respectively: 
when the grain size is 4 and when the grain size is 25. 


Appl. 

P 

Total 

Mean 

g =4 

g=25 

II 

g=25 


4 

1335 

1194 

334 

298 

BUS 

16 

1818 

1567 

114 

98 


32 

1910 

1649 

60 

103 


4 

47545 

40716 

11886 

10179 

CANN 

16 

138453 

80334 

8653 

5021 


32 

171965 

89042 

5374 

2783 


4 

5336 

3768 

1334 

942 

DWT 

16 


5482 

645 

342 


32 

■ 

5950 

353 

185 


4 

38424 

29382 

9606 

7346 

LAP 

16 

100012 

44738 

6251 

2796 


32 

113717 

48863 

3554 

1527 


4 

42044 

29899 

10511 

7475 

LSHP 

16 

106973 

57773 

6686 

3611 


32 


B 


1883 


Table 2: Block mapping communication. 

Recall that the grain size is the minimum number of elements in any triangular or 
rectangular partition. In both cases, total communication increases with the number 
of processors for all the test problems. However, when the grain size is increased from 
4 to 25, there is a significant reduction in communication. For instance, in the LAP30 
problem, the g = 4 and g = 25 columns for total communication in table 2 show that 
there is more than 50% reduction in the total communication for p = 16 and p = 32. 
This is due to the fact that as the block size increases, more work is done in each 
block with a lot of re-use of data. 

Table 3 describes the work distribution in the block scheme for grain sizes 4 and 25. 
In contrast to the reduction in communication with higher grain size, in most cases, 
there is an increase in load imbalance. Furthermore, the load imbalance factor A 
increases, in general, with the number of processors, as well. 
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Overall, the larger the grain size, the smaller is the communication, at the cost of 
larger load imbalance. If the application is run on a system with high communication 
cost as compared to computation cost, the block-based partitioning can give good 
performance i.e. the savings in communication will be more than offset the disad- 
vantage of load imbalance. Also, the load balance can be improved by using more 
sophisticated strategies to allocate blocks to processors. 


Appl. 

Procs. 

Work Distribution 

Mean 

A 

g =4 

g=25 


j 4 

2791 

0.77 

0.8 

BUS 

16 

698 

3.59 

3.59 


32 

349 

6.3 

6.3 


4 

151460 

0.07 

0.122 

CANN 

16 

37865 

0.13 

0.62 


32 

18932 

0.38 

1.26 


4 

11701 

0.17 

0.18 

DWT 

16 

2925 

1.14 

1.37 


32 

1462 

1.48 

3.67 


4 

108644 

0.12 

0.16 

LAP 

16 

27161 

0.13 

1.13 


32 

13581 

0.48 

2.9 


4 

125392 

0.06 

0.24 

LSHP 

16 

31348 

0.25 

0.74 


32 

15674 

0.24 

2.04 


Table 3: Block mapping work distribution. 

Apart from grain size, another parameter used in the tests was the minimum cluster 
width. For instance, if the minimum cluster width is 4, no strip of columns less than 
four columns wide is acceptable as a cluster - it is broken up into individual columns. 
The larger the minimum width acceptable, the fewer number of non-single-column 
clusters there are. For any problem, if the cluster width is set high enough, we end 
up with all single columns. The results of table 2 and table 3 were obtained using a 
minimum cluster width of four. 

Table 4 shows the variation of communication and load distribution with minimum 
cluster width for LAP30. The table shows an increase in communication when the 
width goes from 2 to 4 and then a decrease when the width goes to 8. Load imbalance 
shows a complementary behavior. It decreases when the width goes from 2 from 4 
and then increases when the width goes from 4 to 8. The cluster width has to go in 
step with the grain size. If the cluster width is too small compared to the grain size, 
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a laxge number of skinny clusters would be formed towards the left of the matrix. 
The blocks would not have enough matrix elements to take advantage of reduction in 
communication offered by the large grain size. 



P 

Communication 

Work Distr. 

Total 

Mean 

Mean 

A I 

IHH 

4 

38936 

9734 

108644 

0.03 

2 

16 

96235 


■ 

0.167 


32 

111519 

3485 

■tSmSW 

0.54 


4 

38424 

9606 

108644 

0.12 

4 

16 

100012 

6251 

27161 

0.13 


32 

113717 

3554 

13580 

0.48 


4 

32569 

8142 

108644 

0.62 

8 

16 

88408 

5526 

27161 

1.35 


32 

101725 

3179 

13580 

2.3 


Table 4: Variation with width for LAP30, g — 4. 

Table 5 presents the results for the wrap-mapping case. The immediately noticeable 
property is the consistently uniform load distribution, as seen by the A column. How- 
ever, a smaller grain size in the block scheme gives a two-fold advantage of decrease 
in communication without too much load imbalance as compared to wrap-mapping. 
For instance, consider the CANN1072 problem with 32 processors. For a grain size 
of four, the block case provides a 28% saving in communication in going from wrap 
mapping to the block scheme while the load imbalance factor goes from 0.14 to 0.38, 
whereas when the grain size is 25, the savings in communication over wrap-mapping 
is 63% while the load imbalance factor goes from 0.14 to 1.26. 


5 Conclusions 


In this paper, we have described a block based, automatic partitioning and scheduling 
scheme for factoring sparse matrices on message passing systems. The primary focus 
is towards automating the process so that the tedious task of manual parallelization 
is kept to a minimum. The partitioner makes use of data locality to reduce communi- 
cation overhead and at the same time attempts to provide the necessary flexibility to 
the scheduler in manipulating the work allocation so that the load remains balanced. 
We have used the example of Cholesky factorization to describe the methodology. 
However, it can very easily be adapted to other factoring methods used in sparse 
matrix computations. In fact, it can be generalized to computations that can be 
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Appl. 

P 

Communication 

Work Distr. 

Total 

Mean 

Mean 

A 


1 

0 

HUB 

11164 

mm 

BUS 

4 

2485 


2791 

fm 


16 

3705 

mMm 

698 

HI 


32 

3832 

120 

349 

0.35 


1 

0 


IlifiKEgfil 

0 

CANN 

4 

52363 

1 

151460 

0.01 


16 

171764 


37865 

0.05 


32 

239646 

7489 

18932 

0.14 


1 

0 

0 

46804 

■■ 

DWT 

4 

7599 

1900 

11701 

iSl 


16 

17867 

1117 

2925 

EH 


32 


656 

1462 

0.32 


1 

0 

0 

434577 


LAP 

4 

42663 

10665 




16 

133720 

8357 

27161 



32 

177625 

5551 




1 

0 


501570 

0 

LSHP 

4 

46347 

11586 

125392 

0.01 


16 

146322 

9145 

31348 

0.09 


32 

192977 


15674 

0.24 


Table 5: Wrap mapping. 
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represented as directed acyclic graphs with sufficient information prior to performing 
the computations. 

To analyze the effects on the performance of the partitioning and scheduling tech- 
nique used, we have compared the communication overhead in the form of total data 
traffic with that obtained from an implementation where a straightforward column 
wrap scheme is used. Five representative test matrices from the Harwell-Boeing pack- 
age were used for this purpose. The comparison shows that the block-based scheme 
results in a significant reduction in the communication overhead as compared to the 
wrap-mapping scheme. This is in agreement with our motivation for blocking. On the 
other hand, the block method results in more load imbalance. Wrap-mappings usu- 
ally lead to processors communicating with a large number of other processors leading 
to a large amount of data traffic and possibly to hot-spots. However, in block-based 
schemes, most of the communication among blocks occur within a cluster and hence 
can mostly be confined to small groups of processors. Although the increased load 
imbalance is a serious issue, the provision of the parameters such as the grain size and 
the cluster widths allows one to minimize the load imbalance for particular applica- 
tions. Further study of the structure of the sparse matrices is required to optimize 
these parameters for individual applications. Moreover, in real applications factoring 
is only a part of the overall solution of the system and other computations such as 
triangular solves can provide additional flexibility in the balancing the load which is 
not taken into account here. Finally, more sophisticated scheduling strategies could 
be used to improve performance. Thus, for systems such as message passing archi- 
tectures, where communication overhead is much more expensive than computation, 
automated, block-based methods such as the one described here may prove to be 
better alternatives. 
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